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Abstract 

Regularized kernel methods such as support vector machines (SVM) 
and support vector regression (SVR) constitute a broad and flexible 
class of methods which are theoretically well investigated and com- 
monly used in nonparametric classification and regression problems. 
As these methods are based on a Tikhonov rcgularization which is also 
common in inverse problems, this article investigates the use of regu- 
larized kernel methods for inverse problems in a unifying way. Regular- 
ized kernel methods are based on the use of reproducing kernel Hilbert 
spaces (RKHS) which lead to very good computational properties. It 
is shown that similar properties remain true in solving statistical in- 
verse problems and that standard software implementations developed 
for ordinary regression problems can still be used for inverse regression 
problems. 

Consistency of these methods and a rate of convergence for the risk 
is shown under quite weak assumptions and rates of convergence for 
the estimator are shown under somehow stronger assumptions. The 
applicability of these methods is demonstrated in a simulation. 

1 Introduction 

One of the most important statistical inverse problems is the inverse regres- 
sion problem in which one observes i.i.d. data {z\, y±), . . . , (z n , y n ) from the 
model 

Y = (Af )(Z) + s(Z)e (1) 

in which A is a (known) linear operator between suitable function spaces, 
Afo : Z — > K is the (unknown) regression function, and s : Z — > R is an 
(unknown) scale function. The goal is to estimate the primary function /o : 
X — > R. If A has a bounded inverse A -1 , then /o can simply be estimated by 
A~ l g n where g n is an ordinary estimate of the regression function g = Afo. 
However, in a typical inverse regression problem, A does not have a bounded 
inverse so that one is faced with an ill-posed problem which has to be dealt 



with in different and much more complicated ways. See, e.g., Cavalier (2011 ) 
for an overview. Two of the most common types of estimators in inverse 
regression problems are spectral cut-off estimators and Tikhonov estimators. 
In case of the spectral cut-off estimator, it is assumed that A is an injective 
compact operator between L2-spaces -^(m) and LiiPz) s ° that a singular 
value decomposition exits. That is, there are a complete orthonormal system 
( v j)jeN of 1/2 (A 4 ) > an orthonormal system (uj)j^ in Li{Pz), and singular 
values (ctj )jeN C (0, oo) such that Avj = (JjUj and A*Uj = ajVj where A* 
denotes the adjoint operator of A. Then, the spectral cut-off estimator is 
given by 



In 



J f 



with b 



j=l 



0"-i 



n 



i=l 



where the truncation parameter J = J n acts as a regularization parameter. 
This estimator is investigated in many articles on inverse regression prob- 



lems, e.g., in 


van Rooij and Ruymgaart 


(1996), 


Mair and Ruymgaart 


(1996 


), 


Bauer and Munk 


(2007), Bissantz and Holzmann 


(2008 


), and 


Bissantz and 


Birke 


(2009) 


. A disadvantage of this estimator is that 


one needs to know 



the singular value decomposition of A in order to compute the estimator 
(this is also a considerable limitation for general software implementations). 
Furthermore, one usually has to know the distribution Pg of the covariate 
Z. Most articles on spectral cut-off estimators assume that Z is uniformly 
distributed on [0, 1] or use equidistant design points. 

Methods based on Tikhonov regularizations are common in non-stochastic 
as well as in statistical settings of inverse problems. In inverse regression 
problems, the estimator based on Tikhonov regularization is the minimizer 



arg mm 
feH \ n 




(Af)( Zl )) 2 + \ 



(2) 



Such an estimator has 



where H is a Hilbert space of functions / : X — > 
been considered, e.g., in O'Sullivan (1986), Mathe and Pereverzev (2001), 



Bissantz et al. 


(2007 


), and 


Cavalier 


(2008) 



Most articles on Tikhonov reg- 
ularization in statistical inverse problems focus on rates of convergence, but 
simulations or applications on real data sets are only rarely done. One reason 
might be that calculating the estimator is, in general, not an easy task and 
suitable software implementations (e.g., as R-packages) are still widely miss- 
ing. The situation gets better if reproducing kernel Hilbert spaces (RKHS) 
are chosen for H in ^ . These Hilbert spaces have excellent properties from a 
computational point of view and, therefore, recently attract much attention 
in statistics, machine learning, and approximation theory. Tikhonov estima- 
tors Q in an RKHS are already used in the early work Wahba (1977) and 
Wahba ( 1980 ); there, a special case of model ([!]) is considered in which the er- 
ror is homoscedastic (i.e., s = 1), X = Z = [0, 1], the data Zi, i E {1, . . . , n}, 
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are equidistant design points, and A is a Fredholm integral operator of the 
first kind, that is, 



Af(z) 



K(x,z)f(x)X(dx) Vze[0,l], feH. 



A similar setting is also considered in Nychka and Cox ( 1989 ) which shows 
rates of convergence for Tikhonov estimators (|2j) in an RKHS. In ordinary 
nonparametric classification and regression problems, lots of research has 
been done on methods based on RKHS such as support vector machines 
(SVM) and support vector regression (SVR) during the last decade. These 
methods belong to a broad class of methods called regularized kernel meth- 
ods. In an ordinary (heteroscedastic) regression problem 



Y = f (X) + s(X)e, 



(3) 



the estimate of a regularized kernel method is the minimizer 



arg mm 



\ i=i 



UiJ(zi))+\\\f\\% 



(4) 



where H is an RKHS and L is a suitable loss function; see, e.g., Vapnik| 



(1998), Scholkopf and Smola (2002), and Steinwart and Christmann (2008) 



In view of (|2j), regularized kernel methods can also be defined for inverse 
regression problems in an obvious way by 



arg mm — 
feH \ n 



i=l 



L(z i ,y i ,(Af)(z l )) +A 



(5) 



The goal of the present article is to investigate these methods (which con- 



siderably generalize the setting of the early work in Wahba (1977), Wahba 



(1980), and Nychka and Cox (1989)) in a unifying way in the light of the 
current state of research on regularized kernel methods for ordinary regres- 
sion problems ([3]). The results are not restricted to X = Z = [0, 1] but X 
may be any compact subset of lZ d for any dimension d 6 N and Z may be 
any Polish space. Furthermore, we also consider heteroscedastic errors as 
homoscedasticity is less frequent in real data sets. Allowing for different loss 
functions, on the one hand, extends the applicability of the method from 
mean regression to tasks such as median regression, quantile regression, and 
classification. On the other hand, it is well known that the least-squares 
loss typically induces bad robustness properties and that regularized kernel 
methods for Lipschitz-continuous loss functions (such as the absolute devia- 



tion loss) have very good robustness properties; see Christmann and Stein 



wart 



(2007), Christmann et al. (2009), and Hable and Christmann (2011). 



Formally, using the least-squares loss simply was a computational need; due 
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to the nice structure of the least-squares loss, calculating ^ is equivalent 
to solving the equality 



A*Af + nXf = A*y. 



(6) 



in H and, for compact operators A with singular system (aj;Vj,Uj), the 
solution is given by 



- {y, Ur, 



see, e.g., (Engl et al. , 1996 p. 117). If H is an RKHS, solving Q is much eas- 



ier but still involves calculating the inverse of an n x n-matrix; see (Wahba 
1977, p. 654). However, nowadays, enormous efforts have been made in 
order to develop powerful software implementations for calculating regular- 
ized kernel methods Q such as SVM and SVR for various loss functions. 
It turns out (Theorem |2.3[ ) that these implementations can still be used in 
oder to calculate ([5]), i.e., regularized kernel methods for inverse problems. 
One only has to calculate a certain "pseudo" kernel matrix M and proceed 
with standard software implementations as if M was the kernel matrix K 
from an ordinary regularized kernel method; see Section [2] and [4] for details. 

In a non-stochastic setting, an RKHS in ^ has also been considered in 



Krebs et al. (2009) for the least-squares loss and in Krebs (2011) for the 



e-insensitive loss, a common loss function in machine learning. Also in a 



non-stochastic setting, Eggermont et al. (2012) consider an RKHS in the 



Tikhonov method (with the least-squares loss) but in a quite different way: 
there, the codomain of A is the subset of an RKHS while we use an RKHS 
as the domain of A. 

The rest of the article is organized as follows: Section [2] contains notations, 
assumptions, and the general definition of regularized kernel methods for in- 
verse problems. Furthermore, it is shown that the estimators uniquely exist 



(Theorem 2.4) and that an analogue of the empirical representer theorem 



holds which enables to use standard software implementations developed 
for ordinary regularized kernel methods (Theorem 2.3). In Section [3j con- 
sistency in the H-noim (which is stronger than the supremum-norm) and 
a rate of convergence for the risk is shown under quite weak assumptions 



(Theorem |3.1| and Theorem 3.2). A rate of convergence of the estimator 
in the H-noim is shown under somehow stronger assumptions (Theorem 
33) ). Section g contains a simulation in a standard example, namely the 
heat equation. All proofs are deferred to the appendix; Subsection |6.1| in 
the appendix also contains a number of additional results which are needed 
in the proofs of the main results and are interesting on its own. 
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2 Regularized Kernel Methods in Inverse Prob- 
lems 



Throughout the whole article, we deal with the general setting summarized 
in the following assumption: 

Assumption 2.1 Let P be a probability measure on [Z x y,%$zxy) where 
Z is a Polish space, y C M is closed and %$zxy is the Borel-a -algebra 
on Z x y . The marginal distribution of P on Z is denoted by Pz; the 
corresponding conditional distribution on y given z is denoted by P(-\z). 
That is, 

j gdP = J! g(z,y)P(dy\z)Pz(dz) V 5 € Li(P) . 

Let X C M. d be compact and let k be a kernel on X which is extendable to 
an m-times differentiate kernel on M d where m > |. The RKHS of k is 
denoted by H. The operator 

A : H — )• Cb(Z) is continuous and linear (7) 

where Cb(Z) denotes the Banach space of all bounded, continuous functions 
g : Z — > M. with supremum-norm \\ ■ ||oo- 

The function L : Z x y x M — > [0, oo) is a continuous loss function; that is, 
the function t \— > L(z, y, i) is convex for every fixed z G Z , y G y . 

In order to prove consistency and rates of convergence, Assumption ([7]) will 
be replaced by the stronger assumption that 

A : Cb(X) — > Cb(Z) is continuous and linear (8) 

later on. However, it is always made explicit whenever ^ is assumed instead 
of 0. Note that Assumption Q indeed implies ([7]) because, according to 



(Steinwart and Christmann, 2008, Lemma 4.23), 



< ll&IUII/lk V/etf. (9) 

Articles on inverse regression problems typically assume compactness of the 
operator A and it seems as if we had no such assumption here. However, 
A enjoys a compactness property which comes for free in this setting. As 
shown in Prop. 6.1 it follows from Q that A : H — > Cb(Z) is a compact 



operator, but this compactness is a quite weak property. The reason for this 
is that weak convergence in the RKHS of a bounded continuous kernel is a 
relatively strong kind of convergence. In particular, weak convergence in H 
implies pointwise convergence, which is an easy consequence of the so-called 
reproducing property 

(f,$(x)) H = /(*) VxG X, feH (10) 
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where denotes the canonical feature map of H, i.e., 3>(x) = k(x, •) for 
every x G A 7 . 

In most parts of the article, we will also impose the following assumption 
on the loss function L: 

Assumption 2.2 Assume that L is continuous and that there are b £ 
C\{P), cl £ (0,oo), and /3l G [1,2] such that, for every z £ Z, y £ y, 
and t £ R, 

L(z,y,t) < b(z,y)+c L \tf L . (11) 

In addition, assume that there are p £ [0, 1], b' Q £ £2(Py) with b' Q > 0, and 
b' Y £ [0,oo) such that, for every z £ Z, y £ y, a £ (0,oo), and t\,tz £ 
[-a, a], 

\L{z,y,h)-L(z,y,h)\ < (b' (y) + b' l( f) ■ \h - t 2 \ ■ (12) 

This assumption looks quite special but, indeed, covers all of the commonly 
used loss functions: least squares, hinge, truncated least squares, and lo- 
gistic for classification; least squares, absolute distance, pinball, epsilon- 
insensitive, Huber, and logistic for regression. 

For D n = ({z 1 ,y 1 ), (z n ,y n )) £ (Z x y) n and A > 0, define 

/a,d„,a = argmin ^ £ L( Zi , yi , (Af)( Zi )) + A||/||^ (13) 
and the regularized empirical risk 

n 

K A ,D,x(f) = - J2 L {^Vi^ A f)( z i))+M\f\\H V/€ff. 



n 



That is, fA,D„,x = argmin^ e jj7^A,£),A(/)- The following theorem is the ana- 
logue to the well-known representer theorem in case of ordinary regularized 



2008 



kernel methods (i.e. A = id); see, e.g., (Steinwart and Christmann 
Theorem 5.5). By use of this theorem, the optimization problem (13) in 
the infinite-dimensional function space H can be reduced to a convex op- 
timization problem in W 1 . In similar but non-stochastic inverse problems, 
corresponding results have already been obtained for special loss functions in 
( |Krebs et al. 2009 Lemma 3.1 and Theorem 3.2) and (Krebs, 2011, Lemma 



3.1). 

Theorem 2.3 (Empirical Representer Theorem) 



Let Assumption 2.1 be fulfilled. For every D n = ((zi, y %),... ,(z n ,y n )) £ 
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(Z x y) n and A > ; the function f A ,D„,\ defined by (IS) uniquely exists. 
There are cti, . . . ,a n £l such that 

n 

fA, Dn ,x(-) = X>-(^KO)(*)- ( 14 ) 

i=l 

The matrix M e K" x " defined by 



Mij = \A[{A*(.))(zi)) Uzj) (15) 
is symmetric and positive semi-definite. For every a = (ai, . . . , a n ) T E M n ; 

, n \ 1 " 

K a ,d,\ J^Oi-(A*(-))te) = -^L(z i , 2/i ,Q T Me i ) + Aa T Ma (16) 
V i=i y n . =i 

where e« denotes the i-th vector in the standard basis ofW n . 



Theorem 2.3 is of great practical importance because it says that estimates 
can be calculated essentially by finding a minimizer of 



1 - 

- L(zi, yi,a T Mei) + \a T Ma 



a i y 

n 
i=i 

in M. n where M is a symmetric and positive semi-definite matrix. This is 
extremely comfortable because calculating ordinary regularized kernel meth- 
ods leads to an optimization problem with exactly the same structure. Ac- 
cordingly, developing new software for inverse problems is unnecessary be- 
cause, after calculating the matrix M, one can use standard software for 



regularized kernel methods such as the R-package "kernlab" (Karatzoglou 



et al. , 2004). In order to calculate M, one only needs to write an R-function 



which takes a function / as an argument and returns the function Af. 

Almost all articles on inverse regression problems assume the homoscedastic 
regression model 

Y = {Af AiP )(Z) +e. (17) 

Instead of only considering such a specific model, we use a suitable loss 
function L and consider the risk 

K A ,pU) ■= j L{z,y,(Af)(z))P{d(z,y)) 

for functions / : X — ¥ M. (in the domain of A). Then, the goal is to estimate 
a minimizer f A ,p of this risk. In this way, it is, e.g., possible to investigate 
heteroscedastic inverse regression problems such as 

Y = {Af AtP ){Z) + s{Z)e, (18) 
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where s is an unknown scale function or, even more general, 



Y = (A/ A ,p)(Z) + s z , 

where (uj,z) h-> £ z (u) is a Markov kernel. In most parts of the article, we 
do not make any specific assumption on the regression model but consider 
minimizing risks. As a theoretical tool, we also need the regularized risk 

n A ,P,x(f) = n A , P (f) + M\f\\ 2 H = jL(z,y,(Af)(z)) P(d(z,y)) + \\\f\\ 2 H 

for / € H and A G (0, oo). This regularized risk is the theoretical counterpart 
of the regularized empirical risk Had X- The following theorem presents 
a simple condition under which a uinique minimizer of the regularized risk 
exists. By choosing the empirical measure for P, the theorem also guarantees 
existence of the estimates fADX- 



Theorem 2.4 Let Assumption 2.1 be fulfilled and 

L(z,y,0)P(d(z,y)) < oo . (19) 



Then, for every A > 0, there is a unique minimizer /a,p,a of f h-> T^A,P,x(f) 
in H . 



3 Consistency and Rate of Convergence 

Define i.i.d. random variables 

(X u Yi) ~ P, ieN. 

Then, the data set D n is a realization of the random vector 

D n = {(Xi,Y 1 ),...,(X n ,Y n )). 

The following theorem guarantees consistency of regularized kernel methods 
for inverse regression problems under quite weak assumptions. In particular, 
it also covers the multivariate case as X may be any compact subset of lZ d , 
it does not require homoscedasticity or any signal plus noise assumption, 
and we do not assume injectivity of the operator A or any properties of a 
singular system of A. 

Theorem 3.1 Let Assumptions 2.1 and 2.2 be fulfilled, and let A fulfill |#p. 
Assume that 

3f* G H s.t. n A ,p(f) = Pi K A ,p{f) := K\ P . (20) 
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Then, there is an f ap £ H such that 



T^A,p{fA,p) 



miK A>P (f) 



and, for every sequence (A n ) ne j*j C (0, oo) such that linin^oo A 



l+p/2 



n 



\fA,T>„,\„ ~ fA,p\\ H 



-> 



in probability 



(21) 

= oo, 
(22) 



Note that (22) also implies 



sup \fA,n n ,x n {x) - Ja,p(x)\ 



-»• 



in probability . 



As already mentioned above, Theorem 3.1 does not require any signal plus 
noise assumption. Instead, it is only assumed that H contains a minimizer 
of the risk. In order to make this assumption more explicit in case of a 
signal plus noise assumption, it is exemplified for special choices of the loss 
function L and the RKHS H. Consider the heteroscedastic model 



Y = Af (Z) + s(Z)e 



(23) 



where s is an unknown scale function and Ee = in case of the least-squares 
loss or median(e) = in case of the absolute deviation loss. (In the latter 
case, it is additionally assumed that is the unique median of e; error 
distributions which violate this assumption are extremely unusual.) Then, 



Assumption (20) is fulfilled if fo £ H. As we have not assumed injectivity of 
A so far, model (23) is not necessarily unique. It is possible that fo / f A ,p 
but it follows from (21) that AfA,p = Afo -Pz -a.s. so that model (23) can 
be rewritten as 



Y = Af A MZ) + s(Z)e; 



(24) 



Obviously, distinguishing between models (23) and (24) is impossible. In 



order to prove rates of convergence for /a,d„,a„ — /a,p below, we will also 
assume that A is injective. Under this standard assumption, the model is 
unique and fo = fA,P- 

In a parametric setting, it is typically assumed that fo is linear or a polyno- 



mial. This assumption easily implies (20) if A; is the linear kernel or a suitable 
polynomial kernel. In a nonparametric setting, the most common kernel is 
the Gaussian RBF kernel. However, assuming that a function fo lies in the 
corresponding RKHS H is a rather strong and inaccessible assumption which 
can hardly be made explicit for a practitioner - even though this RKHS is 
dense in C{X). Therefore, it seems advisable to use slightly different kernels 
in the nonparametric setting, namely Wendland kernels. These are radial 
kernels of the form 



kd,m(x,x') = 4>d,m(\\x - x'\\) with 4> d ,m(r) 



Pd,m(r), < r < 1 
0, r > 1 
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Dimension Wendland polynomial 



d = 


1 


Pl,i( r ) 


= (1 


_ r * 


3 (3r + l) 


d = 


2 


P2,2(r) 


= (1 


_ r * 


6 (35r 2 + 18r + 3) 


d = 


3 


P3,2(r) 


= (1 


_ r ' 


6 (35r 2 + 18r + 3) 


d = 


4 


P4,3(r) 


= (1 


_ r * 


9 (693r 3 + 477r 2 + 135r + 15) 


d = 


5 


P5,3(r) 


= (1 


_ r * 


9 (693r 3 + 477r 2 + 135r + 15) 



Table 1: Suitable Wendland polynomials pd, m for different dimensions d; 
that is, m = (d + l)/2 if d is odd, m = d/2 + 1 if d is even. 



where pd jm is a certain polynomial of degree [d/2\ +3m + l. The polynomial 
is of minimal degree such that kd t m is m-times continuously differentiable; 
see (Wendland, 2005, Theorem 9.12 and 9.13). Though the shape of these 
kernels is very similar to that of the Gaussian RBF kernel, Wendland kernels 
have two advantages: First, they are compactly supported and therefore lead 
to sparse kernel matrices. Second, there is a simple condition on /o which 
guaranties that fo is contained in the corresponding RKHS Hd jm , i.e., that 
( 20 ) is fulfilled. If the dimension d is odd, choose m = (d+l)/2 and 7 = d+1. 

d/2 + 1 and 7 



If d is even, choose m 



d + 2. Then, / is in H^.m if it 



is the restriction of a 7-times continuously differentiable function on Mr. 
This is a consequence of the fact that the RKHS of &^,m is the Sobolev 
spac e H d / 2+m+ l / 2 (R d ), that 7 > d/2 + m + 1/2, and that X is bounded; 
see (Wendland, |2005| Theorem 10.35 and §10.7). For the convenience of 
the reader, Table [T] contains the relevant Wendland polynomials Pd,m for 



dimensions d < 5; these are calculated from (Wendland, 2005, Cor. 9.15). 
Polynomials Pdm for higher dimensions can recursively be obtained from 



(Wendland, 2005, Theorem 9.12 



For the risk of the estimator, we obtain a rate of convergence - even under the 



quite weak assumptions of Theorem 3.1 For this rate, neither an assumption 



on the singular system of A nor a signal plus noise assumption such as (17) 



or (18) is needed. 



Theorem 3.2 Let Assumptions \2. 1\ and 2.2 be fulfilled, let A fulfill and 
assume (20). Let (A n ) n£ N C (0,oo) and (a n ) ng N C (0, 00) be sequences such 
that lim^oo A n = 0, lim^^ a„ = 00, 



lim 



; lim 



n.-s>oo xi+P/ 2 
An 



0, 



and 



lim A n a n 

n— >oo 



0. 



Then, 



(^4,p(/a,d b ,aJ ~T^\p 



-> 



in probability. . 



(25) 



(26) 
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In particular, if X n = VOM-p) Vn G N for some constant 7 S (0,oo), 
then every sequence (a n ) rag N C (0, 00) such that 

lim a^ra" 1 ^ 44 ^ = 



fulfills condition (25). 



In the rest of this section, we are concerned with rates of convergence for 
fA,D n ,\„ — fA,p- While Theorems 2.3, 2.4 3.1, and 3.2 are valid for all of the 



commonly used loss functions and do not require any involved assumptions 
on the distribution P and the operator A, obtaining rates of convergences 
for /ad„,A„ — Zap is certainly only possible under much more restrictive 
and involved assumptions. Therefore, we need some preparations, before we 



are able to state such rates of convergence in Theorem 3.3 below 



Loss function L. 

Nearly all articles on inverse (regression) problems which (at least implicitly) 
employ a loss function are restricted to the least-squares loss. (A notably 



exception is Krebs (2011) which uses the e-insensitive loss common in ma- 



chine learning.) While we did not have to specify a special loss function so 
far, we fix a specific loss function now, namely the absolute deviation loss. 
That is, our loss function is 



L : Z x y x R -> [0,oo) 



(z,y,t) h-> \y-t\ 



(27) 



in the following. On the one hand, this choice of the loss function is moti- 
vated by the fact that the absolute deviation loss typically leads to better 
robustness properties than the least-squares loss. In particular, it is shown 
in Hable and Christmann (2011), Christmann and Steinwart (2007), and 



Christmann et al. (2009) that ordinary regularized kernel methods based on 



the absolute deviation loss enjoy a qualitative robustness property and have 
a bounded influence function and a bounded maxbias. On the other hand, 
the absolute deviation loss together with suitable assumptions on the model 
guarantee a bound of the form 
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9*\\ Lq{Pz) < (ftld,p(s) -ftid,P 



where TZu >P (g*) = fL(z,y, g*{z))P(d(z,y)) = vai g jL(z,y, g(z))P(d(z,y)) 
and q,r G (0,oo). However, by adapting the model assumptions, such 



bounds can also be obtained for other loss functions; see (Steinwart and 
Christmann] [20081 § 3 - 9 )- 



Inverse regression model. 

In the following, we assume the heteroscedastic regression model 



Y = {Af A>P )(Z) + s{Z)e, 



(28) 
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where s is an unknown scale function such that 

there are constants c s ,c s G (0, oo) with c s < s < c s , (29) 
the random error e is independent from Z, has 

median(e) = 0, (30) 

and 

the distribution of e has a Lebesgue-density h such that (31) 
3a h ,c h G (0,oo) s.t. My G (-a h ,a h ) ■ h{y) > c h . (32) 

If the distribution of the error e has a Lebesgue-density, then Assumption 
(32) is very natural and, e.g., fulfilled for all unimodal error distributions. 
If h is continuous, it is sufficient that h(0) ^ 0. 

Operator A. 

In order to obtain rates of convergences for fA,n n ,\ n — Ia,p, we will also 
need the standard assumption that 

A : H — » Cb(Z) is injective. (33) 

Let Zq be the support of Pz and let Pz denote the restriction of Pz on Zq. 
Then, the natural embedding lq : Cf,(Z) — > L2(Pz ) defines a continuous 
linear operator Aq := lq o A and it is easy to see that (j33h implies that 



Aq : H -> L 2 (Pz ) is injective. (34) 



Furthermore, it follows from Assumption ([8j) and Prop. |6.1| that Ao is a 
compact operator. In this case, Aq has a singular system (aj-,Vj,Uj)j £ ^; 



see, e.g., (Engl et al. 1996, §2.2). That is, crj, j G N, are the non-zero 
eigenvalues of the self-adjoint operator AqAq such that o\ > 02 > • • • > 0. 
The set {vj \ j G N} is a corresponding complete orthonormal system of H; 
completeness follows from injectivity of Aq and (36) below. Finally, the 
elements 

Aqvj 
\\ A aVj || l 2 (p Zq ) 



form a complete orthonormal system of the closure of {Aq/ \ f G H} in 
L 2 {Pz )- We have 

AoVj = o-jUj , ^o^i = ^i^i Vj G N (35) 

00 

4)/ = 5>j(/,^>tf«i V/Gif (36) 
j'=i 
00 

A o5 = ^2<rj(9,Uj)L 2 (P Zo )Vj Mg^L 2 {P Zo )- (37) 
i=i 

Now, we can state the theorem on rates of convergence for fA,D n ,X„ ~ Ia,p- 
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Theorem 3.3 Let Assumption 2.1 be fulfilled and assume that the marginal 
distribution Py has a finite first moment, i.e., E|Y| = J \y\ P(d(z,y)) < oo. 
Let L be the absolute deviation loss fl?) and assume the heteroscedastic 
model given by Let A fulfill^), and 

Let (A n ) ng N C (0, oo) and (a n )„ g N C (0, oo) be sequences with hin^^oo \ n = 
0, limj^oo a n = oo, and 



lim — - — — 

n-Kx) \ n Jn 



0. 



(38) 



Then, the following assertions are valid: 

(<*) if 



E 



[fA,P,Vj) H \ 



< OO 



(39) 



and lim^oo a n Xn = , then 

2 



On||/A,D, l ,A„ " 

(b) Fix any x £ X. If 



fA, 



H 



E 



/0«0) S 



0"7 



-»• 



in probability. 



(40) 



< oo 



and linin-^oo a n A n = , then 
a n {fA,n n ,\ n (x) - fA,p(x))' 



-> 



in probability. 



(41) 



(42) 



For example, in part (a), let A n = 7n 2 / 5 Vn E N for some constant 7 E 
(0,oo); then all conditions on (a n ) n6 N C (0, 00) are fulfilled, if lim n _^ oc — 
00 and 

lim a n n~ 1/5 = 0. 
n— >-oo 

In part (b), let A n = 7n -1 / 3 Vn E N for some constant 7 E (0,oo); then all 
conditions on (a n )neN C (0, 00) are fulfilled if limn-^oo a n = 00 and 

lim a n n -1 / 3 = 0. 



Assumptions such as (39) in part (a) are often called smoothness assump- 
tions on /a,p an d are common in order to obtain rates of convergence. As- 
sumption (41 ) in part (b) differs as it does not involve $a,p but is a condition 
on a fixed x. The result of part (b) can be interpreted in the following way: 
if the inverse regression problem is only moderately ill-posed in some area, 
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then any /a,p can be estimated with a certain rate of convergence in that 
area. Of course, the convergence in (40) and (42) could also be reformu- 
lated without the exponent 2. However, this formulation makes it possible 
to easily compare the results with other rates of convergence which most 
often apply to 



where 



E||/a,D„,A„ - /a^L 

denotes the respectively considered norm. 



4 Simulation 

In order to illustrate the use of regularized kernel methods for inverse re- 
gression problems, this section contains simulations in a typical example of 
an ill-posed problem, namely backward heat conduction. In a statistical 



setting, this example has also been considered, e.g., in Mair and Ruymgaart 



(1996) and|Bissantz and Holzmann (2013). According to, e.g. (Kress 



1999 



Example 15.3), we are faced with the heat equation 

du d 2 u 
dt d 2 x 

where 

u : [0,1] x [0,T] R , (x,t) i-> u(x,t) 

denotes the temperature at any spatial point x £ [0, 1] and time t G [0,T]. 
The boundary conditions are u(0, t) = u(l,t) = for every t S [0,T] and 
the goal is to recover the initial conditions 

f(x) := u(x,Q), xG [0,1], 

at time t = from (noisy) observations at time t = T. Let A denote 
the operator which maps the initial state / to the final temperature curve 
g = u(-,T) at time T. Then, 

oo „ 

(Af)(z) = u(z,T) = Vexp(-jVT)/ f(x)v j (x)X(dx)v j (z) (43) 



with Vj(z) = \f2 sin(jirz) for every z E [0,1]. That is X = Z = [0,1] and 
(vj)j^ is a complete orthonormal system of L2GO, 1])- As a mapping from 
L2QO, 1]) to 1]), the operator A is self-adjoint with eigenfunctions Vj 

and eigenvalues Oj = exp(— j 2 ir 2 T). We use this standard example so that 
the method can be compared to the spectral cut-off estimator suggested by 



Mair and Ruymgaart (1996). 



The model. We simulate data 

Vi = (Af )(zi) + s s (zi)ei, t €{!,..., n}, (44) 
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CM 



0.0 0.2 0.4 0.6 0.8 1.0 



Figure 1: The regression function A(/o) (solid line) and the function /o 
(dotted line) which has to be recovered. 



with 



fo(x) 



-10x(x — 1) sin(47rx) and s$(z) = 8z(z — 1) 



(45) 



for T = 0.01 and different values of the scale factor 5. The regression 
function A(fo) and the function /o we want to recover are shown in Figure [T] 
We use fixed equidistant design points m = (i — l)/n on [0, 1] because most 
theoretical results on spectral cut-off estimators are for such design points. 
This clearly favors the spectral cut-off estimator also because this estimator 
is heavily based on the knowledge that the Zj are uniform and the regularized 
kernel method does not need (and use) such information. The errors £j 
are (independently) sampled from the standard normal distribution. We 
consider the three scale factors 5 G {0.25,0.5,1} which result in errors of 
similar sizes as in (Bissantz and Holzmann, 2013, §3.2). The sample sizes 



are n G {100,250,500,1000}. 

The estimators. As estimators, we use a regularized kernel method (RKM) 
and a spectral cut-off estimator (SCE). In case of the RKM, we choose the 
absolute deviation loss function and the rescaled Wendland kernel 

kff^x') = ^(lx-x'l/0.3) 



with 0i,i(r) = (1 — r)^_(3r + 1) for every r £ [0, oo); see (Wendland, 2005 
Table 9.1). The scaling 0.3 is approximately equal to the median of the 



n 



values 



x 



j I ' 



i.j G {!,..., n}. This heuristic is the analogue of a 



heuristic used in Caputo et al.| ( 2002| ) and (Karatzoglou et al. 2004, p. 9) in 
order to choose the scaling factor in case of the Gaussian RBF kernel. The 
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n = 


100 


n = 


250 


n = 


500 


n = 


1000 


5 


RKM 


SCE 


RKM 


SCE 


RKM 


SCE 


RKM 


SCE 


0.25 


0.26 


0.32 


0.18 


0.32 


0.14 


0.12 


0.12 


0.09 


0.5 


0.40 


0.32 


0.32 


0.32 


0.24 


0.32 


0.17 


0.16 


1 


0.46 


0.33 


0.42 


0.32 


0.38 


0.32 


0.30 


0.32 



Table 2: The medians of the results b\, . . . , &1000 for both estimators in each 
situation. 



regularization parameter is equal to A = \an 0,45 where a is selected via a 
5-fold cross validation among the values 



10~ 4 , 5-10~ 4 , 10" 3 , 5-10~ 3 , 10~ 2 , 5-10" 2 , 10 



-1 



in each run of the simulation. According to (Mair and Ruymgaart, 1996 



§ 7.2), the spectral cut-off estimator (SCE) is given by 

J b- 1 n 

fn,j(x) = ^2^ !1 Vj(x) with b jt n = - ^2vj(zi)yi 

j=i a i n i=l 

where the number J G N of basis functions is a regularization parameter 
which is selected via a 5-fold cross validation among the values 

2, 3, 4, 5, ... , 10 

in each run of the simulation. 

Performance results. The simulation consists of 1000 runs. In each run 
r € {1, . . . , 1000}, the quality of the estimate is measured by 



[0,1] 



\f (x)-fi r \x)\X(dx 



The medians and the boxplots (the ends of the whiskers represent the 10 
and the 90 percent quantiles respectively) of the values 61, ... , 61000 for both 
estimators are shown in Tableland Figure [2j respectively, in each situation. 

In most cases, the performance of both estimators is similar. However, a 
closer look on the boxplots reveals that, in case of the SCE, the 25, 50, and 
75 percent quantiles are very close to each other but the 90 percent quantile 
is far off for some values of n and 5. The reason for this is that the estimate 
is very stable for small values of J such as J = 4 or J = 5 and the 5-fold 
cross validation most often chooses these values in some situation. However, 
the SCE turns out to be very sensitive to the choice of J in our simulations. 
If larger values such as J £ {6, 7, 8} are selected, than there is a considerable 
danger that the SCE breaks down. Therefore, selecting the right value J is 
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n= 250 
8=0.25 



n= 500 
8=0.25 



n= 1000 
8=0.25 



n= 100 
8=0.5 







n= 250 
8=0.5 

















n= 500 
8=0.5 



n= 1000 
8=0.5 




Figure 2: The boxplots of the results &i,...,&iooo f° r both estimators in 
each situation; the ends of the whiskers represent the 10 and the 90 percent 
quantiles respectively. 
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n = 


100 


n = 


250 


n = 


500 


n = 


1000 


5 


RKM 


SCE 


RKM 


SCE 


RKM 


SCE 


RKM 


SCE 


0.25 


0.53 


0.32 


0.39 


0.32 


0.32 


0.32 


0.25 


0.31 


0.5 


0.74 


0.44 


0.57 


0.65 


0.52 


0.71 


0.37 


1.45 


1 


1.01 


5.33 


0.74 


4.89 


0.69 


2.68 


0.57 


5.59 



Table 3: The 90 percent quantiles of the results b±, . . . , 61000 for both esti- 
mators in each situation. 



crucial and automatic data-driven methods such as a /c-fold cross validation 
might not be sufficient. Table [3] shows the 90 percent quantiles of the values 
61, ... , 61000 for both estimators in each situation. From these quantiles, it 
can be seen that the SCE frequently breaks down for larger values of n and 
S. Therefore, Table [2] does not show the mean but the median of the values 
61, ... , 61000 because the mean is corrupted by large outliers in case of the 
SCE. As this does not happen in case of the RKM, using the median favors 
the SCE. 

Details on computations. The computation of the spectral cut-off estimator 
is simple and extremely fast in this case because the spectral decomposition 
is already known here. As mentioned below Theorem |2.3[ the computation 
of regularized kernel methods for inverse problems can be done by using 
standard software for computing ordinary regularized kernel methods. The 
only additional thing one has to do is to calculate the pseudo kernel matrix 



M as defined in (15). Here, we have 



Mij = y^^exp(-(q 2 + s 2 )-K 2 T)l gtS -v q (zj)v s (zi) 



=1 s=l 



for 



l q ,s = / kf > 1 3 \xi,x 2 )v g (x 1 )v s (x 2 ) \ 2 (d(x 1 ,x 2 )) . 
Ao,i] 2 ' 

As the exponential coefficients decrease extremely fast, the infinite double 
series can be approximated very well by only calculating a few terms, e.g., 
up to q, s = 30 is more than enough. The double integrals I qtS are approx- 
imated by a Monte-Carlo simulation in our simulated example. Then, the 
coefficients ai,...,a n are calculated by the R-package "kernlab" (a stan- 
dard software for calculating ordinary regularized kernel methods), in the 
following way: 

model <— kqr(as .kernelMatrix(M) , y, tau=0.5, C=cost) 
alpha <— alpha (model) 

where y denotes the vector of the observed values yi, tau=0 . 5 corresponds to 
using the absolute deviation loss function, and cost is the cost regularization 
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parameter which is equal to l/(2nA). It is worth mentioning that it is 
sufficient to calculate M once and to reuse this matrix in every step of the 
/c-fold cross validation for tuning cost. Finally, values of the estimate f n 



can be calculated according to (14); here, we have 



fn{x) = J ^2exjp(-q 2 7r 2 T)l g (x) J ^2a i v q (zi) 

q=l i=l 



lg(x) = I kf' 1 3 \x 1 ,x)v q (x 1 ) X(dxi) 



for 

X Q (x) = 

'[0,1] 

Again, the extremely fast decay of the exponential coefficients guarantees 
that the series can be approximated very well by only calculating a few 
terms. 



5 Conclusions 



Regularized kernel methods constitute a broad and flexible class of methods 
which originate from machine learning and are common in nonparametric 
classification and regression problems today. In this article, we investigated 
the use of regularized kernel methods for inverse problems in a unifying way. 
In addition to consistency results under very weak assumptions, we also ob- 
tained a rate of convergence under a typical smoothness assumption on the 
target function. Though such a rate of convergence is of a purely theoretical 
manner and is not interesting on its own for real data analysis, it can play 
an important role in developing methods for statistical inference (such as 
asymptotic confidence sets) based on undersmoothing. However, statistical 
inference is still at an early stage even in case of regularized kernel meth- 
ods for ordinary regression problems as well as in case of inverse regression 
problems with any other estimation method. First steps on statistical infer- 



ence for regularized kernel methods are done in De Brabanter et al. (2011), 



Hable (2012b), and Hable (2012a). In case of inverse regression problems, 



Bissantz and Holzmann ( 2008 ) and Bissantz and Birke ( 2009| ) are concerned 
with asymptotic confidence sets for spectral cut-off estimators. Accordingly, 
enabling statistical inference for inverse regression problems with regularized 
kernel methods is a matter of future and challenging research. Using regu- 
larized kernel methods in real data analysis of inverse regression problems 
seems to be promising as they have nice computational properties and it is 
possible to resort to already existing well developed software implementa- 
tions of ordinary regularized kernel methods. 
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6 Appendix 



6.1 Additional Results 

This subsection contains some additional results on regularized kernel meth- 
ods for inverse problems. On the one hand, these results are needed in the 
proofs of the main results; on the other hand, they are also interesting of 
its own, in particular, as some of them are the counterparts of some of the 
main tools for ordinary regularized kernel methods (i.e. A = id). 
The first proposition shows that, in our setting, compactness of A (in a 
rather week sense) comes for free by assuming (pi). 



Proposition 6.1 Let Assumption 2.1 be fulfilled, and assume that A fulfills 
|#p. Then, A : H — > Cb(Z) is a compact operator; that is, 

(/n)n 6 N C H, f n f \\A{f n ) - ^(/o)|L — + (46) 

where > denotes weak convergence in the Hilbert space. 

The next theorem is a general representer theorem. In case of ordinary 
regularized kernel methods, general representer theorems are the main tool 
for deriving theoretical results on consistency, rates of convergence, asymp- 
totic normality, and robustness. The proof of the following theorem for the 
case of inverse problems is similar to the proof of the general representer 



theorem in the ordinary case (Theorem 5.8 and Theorem 5.9 Steinwart and 



Christmann, 2008, see, e.g.,) even though the assumptions and the result 



considerably differ. 

Theorem 6.2 (General Representer Theorem) 



Let Assumptions \2. l\ and 2.2 be fulfilled, and fix any A > 0. Then, there is 
an hp t \ G ^(P) with the following properties: 

(a) For every (z,y) G Z xy, 




\h P ,x(z,y)\ < b' (y) + b' v ^\A\\^-J bdP+lj , (47) 

h P ,x(z,y) e dL(z,y,(Af A ,p, x )(z)) (48) 

where dL(z, y, •) denotes the subdifferential of the convex function t t-t 
L(z,y,t). 

(b) Lf A* p : L2(P) — > H denotes the adjoint of the continuous linear map 
A P : H -> L 2 (P) given by {A P f){z,y) = (Af)(z) V/ G H, z G 
Z, y G y, then 

fA,P,X = -^A*p{h P ,x) (49) 
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and 
fA,p,\(x) 



1 

2A 



A(<f>{ X ))(z)h P;X (z,y)P(d(z,y)) Vx G X. (50) 



(c) If Pi is a probability measure on Z x y such that JbdP\ < oo and 
J b' Q 2 dPi < oo, then 



Pi,A 



/a 



P,A 



H 



< 



- sup 



hp.AfdPr- / h P ,AfdP 



(51) 



«>/»cre^={/€ir|||/||H<l}. 



The following theorem yields a rate of convergence for the stochastic part, 
that is, the difference between the empirical estimate /ad„ A n an< ^ its ^ ne " 
oretical counterpart fA,p,x n - In case of ordinary regularized kernel methods 
(i.e. A = id), a corresponding result can simply be proven by the represen- 
ter theorem and Hoeffding's inequality for Hilbert spaces. However, in our 
case of inverse problems, the situation is much more complicated because 
working with property (51) of the representer theorem for inverse problems 
is more troublesome than working with the corresponding property in case 
of A = id. Accordingly, we cannot apply Hoeffding's inequality offhand but 
use Donsker theory for empirical processes instead. 



Theorem 6.3 Let Assumptions 2.1 and 2.2 be fulfilled. Let {\ n )n& 
(0, oo) and (a n ) ng N C (0, oo) be sequences such that 



C 



lim A r 



0. 



lim a n = oo, 

n— >oo 



and lim — — — 

n->oo \ l+p/2 r- 
V 



0. 



(52) 



Let be fulfilled for A. Then, 
a n\\fA,T> n ,\n ~ /a,p,a 



+ 



in probability. 



In the following, we are concerned with the deterministic part. Prop. 6.4 



states that the risk of /apa„ converges to the infimal risk; Theorem 6.5 
yields that fA,p.x n even converges in ff-norm to a minimizer of the risk - 
provided that a minimizer exists in H. 



Proposition 6.4 Let Assumption 2.1 and (19) be fulfilled, and let {\ n )n&i C 

>oo A n = 0. Then, 



(0, oo) be a sequence such that lim n _ 
lim H A ,p{fA,P,X n ) 



in] in A>P {f). 



Theorem 6.5 Let Assumptions 2.1 and 2.2 be fulfilled, and let A fulfill |5p. 
Assume that 



3f* eH s.t. n A}P {f 



(53) 
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Then, there is a unique /ap^H with the following two properties: 



T^A,p(fA,p) 



(54) 



\\r\\H>\\fA,p\\H orf* = f A>P . (55) 



Furthermore, for every sequence (A n ) ng N C (0, oo) such that lim 
it follows that 

lim \\fA,p,x n ~ fA,p\\ H = . 



■n-^oo /x n 



K = 0, 



(56) 



The following Lemma 6.6 provides us with a bound of the form (28). Such a 



bound could also easily be adopted from the general results in (Steinwart and 



Christmann, 2008, §3.9). However, in our special situation, it is possible to 



obtain a tighter bound which enables the proof of better rates of convergence 
in Theorem V 



Lemma 6.6 Let P be a probability measure on Zxy such that the marginal 
distribution Py has a finite first moment, i.e., J \y\ P(d(z,y)) < oo. Let L be 
the absolute deviation loss (21) and assume the heteroscedastic model given 



by [2ty-(32). Define a := a h c s > and t* := (AfA,p)(z) for every z G Z. 
Then, there is a B G 53 £ such that Pz{B) = 1 and, for every z G B and 
t G (t* — a,t* z + a), 



L(z,y,t)P(dy\z) 



L(z,y,f z )P(dy\z). 



6.2 Proofs 



Proof of Theorem Q Fix any D n = ((z x ,yi), ■ ■ ■ , Vn)) G {Zxy) n 
and A G (0, oo). Existence and uniqueness of /a,d„,a defined by (13) follow 



from Theorem 2.4 by choosing the empirical measure for P. Define 

T 



H 



f ' ^ [(Af)( Zl ), ... , (Af)(z n ) 



The assumptions on A imply that A is again linear and continuous. Let 
A* denote the adjoint operator of A. Then, for every x G X, 

(A*(ei))(x) = {A*{e i )Mx)) H = {e h A{$>(x)) ) Rn = (a($(x)))(^) (57) 

and, therefore, 



A((A<t>(-))(z t )))(z 3 ) & (A(A*(e t 



<i*( ei ),i*(^; 



H 



(58) 
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This implies that the matrix M is symmetric and, in addition, that it is 
positive semi-definite because, for every a = (ax, . . . , a n ) T G M n , 



a T Ma W ^2a i a j (A*(e i ),A*(e j )) H - 
Fix any a = (ax, • • • , a n ) T G M. n and define 

n 

fo(x) = £ai.(A*(a;))(*) VxG*. 



> 0. (59) 



H 



i=X 



Note that (57) implies that fo*EH and 

J57l 



ll/f" 2 



67' , 



(.-,!)) 



a T Ma 



(60) 



(61) 



Furthermore, 
(4/b)(%) 



(58) 



<ej,i(/o) 

n 



| |60|57t 



8=1 



'.J 



a T Mej . 



(62) 



Hence, (16) follows from the definition of the regularized risk TZa,d,x, (61), 
and (ph. 



It only remains to prove (14), which can be done similarly to the proof of 
(Krebset al. 2009, Theorem 3.2) and (Krebs, 2011 Lemma 3.1). The main 
idea of the proof is to show that /a,d„,a is an element of the image im(A*) . 
First, note that the image im(^ 4*) is a finite-dime nsional linear subspace of 
H, hence, it is closed; see, e.g. ( |Denkowski et al. 2003 Cor. 3.2.17). Then, 
for every /o G H, there is an fx G im(A*) and an f 2 G (im^*))" 1 " such 
that /o = /1 + f 2 ; see, e.g., ( |Denkowski et al.[ |2003[ Cor. 3.7.16). Then, 

(Af )( Zi ) = (Af l )(z i ) + (Af 2 )(z i ) = (Afx)(z i ) + (e i ,A(f 2 )) H = 
= (Afx)(z i ) + (A*(e i ),f 2 ) H = (Afx)(zi) 

and 

\\M\ 2 H = ll/l|| 2 H+ll/2|^. 

This shows that, if fo is not in the image of A*, then there is another 
fx G H such that TlA,D,x(h) < T^A,D,x(fo)- Hence, the minimizer $A,D n ,x 
is in the image of A* , that is, there are ax, ■ ■ ■ , a n £l such that 

(n v n . . n 

= J>i*(e0 W J2<*i-(AH-))(zi). 
i=l ' i=l i=l 

□ 
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Proof of Theorem 12. 4t Consider 1Z A ,p.\ as a map from H to 1U {±00}. 
the map Ha,p,\ is convex and fulfills 7&A,P,\(f) > > —00 for every / G H 



and lim 



II/IIh^oo 



T^A,p,x(f) = 00 because 



liminf TZ APX (f) > A lim inf \\f\\ 2 H = 00. 

II/Hh->°° II/IIh^oo 



According to Assumption Q, 
fn ► f wH 



Wn)(z) 



-> (A/)(z) Vz£2 



so that Fat ou's Lemma -e.g. (Denkowski et al. 2003, Theorem 2.2.17) - im- 



plies that "R-apx is lower semicontinuous. Then, it follows from (Denkowski 
et al. 2003 Prop. 5.2.12) that there is an f A ,p.\ £ H which minimizes 
TZ-A.p.x hi H. Assumption (19) implies 7&A,p{fAP\) < 00 so that unique- 



ness of Ja,p,x G -H" follows from strict convexity of the squared norm and 
convexity of TZa,p- d 



Proof of Prop. 6.1 : See, e.g., (Denkowski et al. 2003, Prop. 3.7.47) for 



the fact that compactness of A : H — > Cb(2) is equivalent to (46). In order 



to show (46), fix any sequence (/ n )neN C H which converges weakly in H 
to some /o £ H for n — > 00. As a weakly convergent sequence in a Hilbert 
space is bounded (see, e.g., Denkowski et al. 2003, Cor. 3.4.10), there is 



ac£ (0,oo) such that, for every n E N, we have ||/n||# < c. Hence, for 
every sequence rr£ — )• xq in X, 



lim sup |/n(a^ 
< lim sup ||/n||rr-||$0 



® hm sup|</ n ,*(x/)-*(ajo)>„| < 



- /n(^o)| ~ 11m sup 

lim ||*(a;/) - $(x )|L 



— $(xn)|| „ < c- lim 



= 



since continuity of k implies continuity of $; see, e.g., (Steinwart and 



|Christmann 2008, Lemma 4.29). That is, we have shown that the sequence 
(/n)neN is equicontinuous. In addition, it follows from weak convergence 



in H and the reproducing property ( 10 ) that f n converges to /q pointwise. 



Since X is compact, pointwise convergence together with equicontinuity 
implies uniform convergence of /„ to /o; see, e.g., (Denkowski et al. 2003 



Prop. 1.6.14 and Theorem 1.6.12). Hence, the statement follows from as- 
sumption Q. □ 

Proof of Theorem 16.21 We start with the proof of (a) and (b). Define 



g H- / L(z,y,g(z,y)) P(d(z,y)) . 



Qp : L 2 {P) 



That is, the risk TZ A ,p '■ H — > R is given by TZA,p{f) = (Qp Ap)(f), 
f £ H. Note that Assumption |2.2| implies that Qp is defined well. Then, 
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it follows from ( Rockafellar 1976, Prop. 2C and Cor. 3E) that the subdif- 
ferential of the convex map Qp is given by 

dQp(g) = {k L 2 (P) | h(z,y) G dL(z,y,g(z,y)) P{d(z,y)) - a.s.}; 
see, e.g., ( Steinwart and Christmarmj |2008[ Prop. A. 6. 13). It is easy to see 



from Assumption 



2.2 



that Qp is continuous on L 2 (P) and, therefore, 
dTZ AP (f) = d(Q P oA P )(f) = A P (dQ P (Apf)) ; 
see, e.g., ( Denkowski et al.| |2003| Theorem 5.3.33). That is, 



dK A , P (f) = A *p( h ) 



h G L 2 (P), 

h(z, y) G 8L(z, y, (A P f) (z, y)) P(d(z, y)) - a.s. 



The convex map H 



2 H is Frechet differentiable with deriva- 



tive 2A/ - see, e.g., (Denkowski et al. , 2003, Example 5.1.6(c)) - and, there- 



fore, its subdifferential at / G H is given by {2A/} - see, e.g., (Denkowski 
Prop. 5.3.30). Since K A , PiX (f) = K A , P {f) + A||/||| for every 



et al. 



2003 



/ G H, it follows that 

dn Ai p tX (f) 



2Xf + dK AP (f) , feH; 



see, e.g., (Denkowski et al. 



2003 



Theorem 5.3.32). Since *R. At p,\ attains 
its minimum in H at /a,p,A) it follows from the definition of the subd- 
ifferential that G d1Z Aj p : \(f A ^p : \). That is, there is an h G C 2 {P) 
such that h(z,y) G dL(z,y, (Apf)(z,y)) for P-a.e. (z,y) G Z x y 
and /a,p,a = — j\A* P (h). In the following, it is shown that we can even 
choose hp t \ G C 2 {P) such that hp^\(z,y) G dL(z,y, (Apfy(z,y)) for ev- 
ery (z,y) G Z x y. For every (z,y) G Z x y, let L' + (z,y,-) denote the 
right derivative function of L(z,y, •). Recall from ( |Rockafellar| 1970, The- 
orem 24.1 and p. 229), that this is a function L' + (z,y,-) : M — > M and 
L' + (z,y,t) G dL(z,y,t) for every (z,y) £ Z x ^ and t £ R, Since the 
function L' + : (z,y,t) \-t L' + (z,y,t) is the pointwise limit of a sequence 
of measurable functions, L' + is measurable. Hence, there is a P-null-set 
N G %$zxy such that hp\ G C 2 {P) defined by 

hp,\(z,y) = h(z,y)I N c(z,y) + L' + (z,y,(A P f)(z,y))l N {z,y) 



fulfills (|48|) and (|49j). Next, (|50|) follows from 



Ia,p,\{x) 



Ja,p,\Mx)) h 
1 

~2A 



(19) 



1 

2A 



P,A> 



(Aph px ,<S>(x)) H 
p(^)))l 2( p) = ~Y X 



A(<S>(x))h P:X dP 
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for every x £ X. Finally, in order to prove (47), note that 

4fA,p,\\\ 2 H < n A ,P,x{f A ,p,\) < K A>P ,x(0) § JbdP 
and, therefore, ([7]) implies 

||^/AP,a|L < \\A\\-\\f A , P ,x\\ H < \\ A \\J\ JbdP +1 =: a <oc. (63) 

Now, fix any (z, y) £ Z x y, t G (—a, a), and 74 G dL(z,y,t). Then, the 
definition of the subdifferential implies 



7t •(«-*) < L(z,y,a)-L(z,y,t) T (6' (y) + &ia p ) • ]a - i 



and 



7t-(-a-t) < L(z,y,-a) - L(z,y,t) S (b' (y) + b[a p ) ■ \ - a - t\ . 

—a — t, 



Dividing these inequalities by \a — t\ = a — t and — | — a — t 
respectively, leads to 

\jt\ < b' (y) + b' ia v . 



(64) 



Due to (63), we may choose t = (Af A .p t \)(z) G (—a, a) so that (48) and 
([64]) yield 

\h P ,x(z,y)\ < &&(!/) + V- 



Then, (47) follows from the definition of a in (63); that is, we have proven 
parts (a) and (b) of the theorem. 

For the proof of part (c) , let Pi be any probability measure on Z x y such 
that JbdPi < oo and J b' 2 dP\ < oo. In order to shorten the notation, 
define h := h P ^ fo ■= fA,p,\, and f% := f A>Pl ,x- Then, ([48]) implies 

h (z,y)(Af 1 (z) - Af (z)) < LfavtAftiz)) - L(z,y,Af (z)) (65) 

for every (z,y) £ Z x y. The map A Pl : H — > L2(P\) defined by 
(Ap 1 f) (z, y) = (Af)(z) is a continuous linear operator; let denote 
its adjoint operator. Since /io G p2(Pi) according to (47) and the assump- 
tions on Pi, it follows that 



h (z, y) (Ah{z) - Af (z)) P 1 (d(z, y)) 



{hQ,A Pl (h -fo)) L2(Pl) 
[h - /o, A* Pi ho) H . 



Hence, integrating both sides of (65) with respect to Pi implies 
(A - f ,A* Pl ho) H < n A ^{h) - K APl (f ) . 



(66) 
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An elementary calculation shows 



2 ^(/i - fo,fo) H + A||/o - fi\\ 2 H — X\\fi\\ 2 H — \\\fo\\ 2 H ■ 



(67) 



Then, calculating (66 ) + (67), the definition of the regularized risk TZa,Pi,\, 
and the definition of f\ = /api X imply 

</i-/o,^ 1 / lo + 2A/o> H + A||/o-/i||g < n A ^ x {h)-K A ^ x {fo) < 0. 



Hence, it follows from (49) that 

A||/o-/i||h < (fa-h,A* Pl h -A* P ho) H < \\f - fi\\ H \\A* Pi h - A* P h \\ H 
and, therefore, 

A||/i-/o|| H < \\A Pl h - A P h \\ H = swp\(f,A* Pl ho) H - (f,A* P h ) H \ 



sup 



sup 



(A P J, h ) L2{Pi) - (A P f, h ) L2{p) 

h (z, y) (Af) (z) Pi (d(z, y)) - [ h (z, y) (Af) (z) P(d(z, y)) 



□ 



Proof of Theorem 16.31 For every n £ N, take h n := h P> x n from Theorem 
16.21 and define 

9n,f ■= . " h n Af V/G J 
where F = {/ G H\ \\f\\ H < l}. Then, 

Gn ■■= {g n j I / G ?} 

is a changing class in the sense of ( |van der Vaart[ [1998 § 19.5) and we can 
use a Donsker theorem for such classes in order to prove 



n 



-yZg n j{Zi,Yi) - E P g n j ) 
n U J 



~* in loo (J - ) (68) 



below. Then, it follows from (68), the definition of g n f, and the continuous 
mapping theorem (e.g., van der Vaart and Wellner, 1996, Theorem 1.3.6) 
that 



— sup 
A n far 



n 

-J^hp^iZ^YAAfiZi)- h PXn AfdP 
t=i 



~» 0. 



(69) 



Since weak convergence to a constant implies convergence in (outer) prob- 
ability (see, e.g., van der Vaart and Wellner, 1996, Lemma 1.10.2), the 
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statement of Theorem 6.3 follows from (69) and (51 ). Measurability of the 
random variable ||/a,d„,a„ — fA,P,\„\\H can be proven by simply following 
the lines of the proof of (Hable, 2013, Lemma 9a). (The only noteworthy 
difference is that proving the analogue of (Hable, 2013, (54)) involves an 



application of Prop. 6.1 



Hence, it only remains to prove (68) and this is done by use of ( |van der 



Vaart 



1998 



19.5) in the following. To this end, note that Q n = {g n ,f \ f £ 
T\ is a class of measurable functions indexed by T . For the metric p de- 
fined by /9(/i,/ 2 ) = ||/i - /2II00, /i,/2 e F, the index set F is totally 



bounded because F is relatively compact in Cf,(X) according to (Steinwart 



and Christmann, 2008, Cor. 4.31). It follows from lim n _ s . 00 A n = and (47) 



that there is a c G (0, 00) such that, for every n G N, 

\h n (z,y)\ < b' (y) + c\-p/ 2 V(z,y)eZxy 



(70) 



Since - Af^ < - h\\ H < 2\\A\\ for every f h f 2 G F, it 

follows from the definitions that 

J {9n, h -9nj 2 ) 2 dP < 4||A|| 2 .(^=) jh 2 n dP V/i,/ 2 GJT. (71) 



Then, an easy calculation using (52), (70), and (71) shows 



lim sup / (g nJl - g nj2 ) 2 dP = 0. 



(72) 



Due to (52), there is a G G C2{Py) such that, for every n G N and (z,y) G 

z xy, 



''"" A|I M*.V)| T X 1 ^^) + < G(y). (73) 



K,\/n 



Since \\Af\\< 
that 



< \\A\ 



H < 1 1 ^4 1 1 for every / G J 7 , it follows from (73) 



I <7n,/(*> 2/) I < G (y) V(z,y)eZxy, n G N, / G J\ 



(74) 



Hence, G is an envelope function of Q n for every n G N which fulfills the 
Lindeberg conditions J G 2 dP < 00 and limn^oo J G 2 I G>e ^ dP = for 
every e > 0. 

Let 1 1 A 1 1 oo denote the operator norm of A as an operator from Cb(X) to 
C b (Z). Then, it follows from d73l) that 



\g n ,fi(z,y) -g n j 2 ( z ,y)\ < \\A\\ooG(y)-\\fi -f- 



2 00 



According to flvan der Vaart and Wellner 1996, § 2.7.4), this implies 



iV [] (2 £ ||A|| 0O ||G|| L2( p ) ,g n ,||-|| L2(P) ) < N(e,F,\ 



(75) 
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denotes the bracketing number and iV(-) denotes the covering 

! 2.1.1). According to 



where Nr 

number; see, e.g., (van der Vaart and Wellner 



the assumptions on X and k, it follows from ( 



1996 



van der Vaart and Wellner 



1996, Theorem 2.7.1) that there is some Cq G (0,oo) such that, for every 
e > 0, we have logJV(e, T, || • ||oo) < C e" d/m ; see, e.g., dHable[|2012b"l Eqn. 
(61)). Hence, it follows from ( |75[ ) that 



logJVnfe,^ 



\L2{P), 



< Cq-(2|M|| 0O ||G| 



L2(P)) 



Recall that m > d/2. Hence, the bracketing integral fulfills 

r5n 



J[] (Snt Gn, II • llza(P)) = J \/ lo g N [\ ( e > Gn, 

for every sequence of real numbers 5 n \ 0. 



\L2(P), 



de 



Ve > 0. 



-> (76) 



Finally, the definition of g n j, (70), and (52) imply \im. n -+ ao g n j{z,y) = 
for every / G T and (z,y) G Z x y so that the dominated convergence 
theorem and (74) yield 

lim Elgnj^Zi, Yi)g ni f 2 (Zi, 3^)] - E^^, Y^E^^, r<) = (77) 



for all /i, / 2 G J". Summing up, because of (|72j), (|74J), Q76J), and (|77|), the 
assumptions in (|van der Vaart 1998[ § 19.5) are fulfilled and it follows from 
(|van der Vaart] 1998| Theorem 19.28) that 



G P 



m 



where Gp is a tight Gaussian process. That is, it only remains to prove 
= 0. This follows from considering finite marginals: Fix any ft, . . ■ , f a G 



J 7 and note that, due to (74) and (77), the (multivariate) Lindeberg- Feller 
central limit theorem (e.g., van der Vaart, 1998, Prop. 2.27) for the random 

i€{l, 



vectors 



(gnj^Z^Yi), . . . ,g nJe (Zi,Yi)) 
implies that (Gp(/i), . . . , G P (f s )) = 0. 



,n}, 



□ 



Proof of Prop. \6.4\ According to Assumption ( |19[ ) the set Hq := {/ G 

H \ HA,p(f) < oo} is non-empty. Since .By : [0, oo) — >• M, A H- 7£a,p,a(/) 
is continuous for every / G -Hcb the map B : [0, oo) — > R defined by 
-B(A) = inf /e_ff Bf(X), A G [0, oo), is upper semi-continuous. For every 



n G N, the function /a,p,a„ £ ^ uniquely exists according to Theorem 2.4 



and we have B(X n ) = Ha pa„ (/a P,A n ) ■ Then, the statement follows from 

< limsup7^x,p(/A,p,An) ^ 



inf 7^.4 p(/) < lim inf Ha.pUa.p.X^ 

feH n->oo 



< lim sup B(\ n ) < B(0) 



inf K A ,p(f) 



mfn Ai p(f) 



□ 
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Proof of Theorem 16. 5t First of all, it is shown that 



(/«) CH, f n 



-> fo 



lim K A ,p(fn) = TlA,p(fo) 



(78) 



According to Prop. 6.1, it follows from weak convergence of the sequence 



(fn) 



neN 



C H that lim,, 



A(f n ) - A(f 



0. Hence, there is an 



a G (0,oo) such that, for every n G No, we have L4(/„) < a. Then, (78) 
follows from 



T^A,p(fn) - T^A,p(fo 



b' + b' ia p )-\A(f n )-A(f )\dP 
< \\A(f n ) - Aifo)^- [ {b' + b' ia P)dP. 



Next, define 



H := {/* G H TZaAH = pf K A , P (f)} 



It follows from ( |78| ) that i^o is closed in i?. Furthermore, it follows from 
convexity of t \-t L(z, y, t) that Hq is a convex set. Hence, Hq has a unique 



element /ap of smallest norm; see, e.g., (Denkowski et al. 2003 Theorem 



3.7.9). That is, we have shown unique existence of an element f A ,p 6 H 



which fulfills (54) and rt55 



The following proof of ( |56[ ) is similar to the proof of ( |Steinwart and Christ- 
Theorem 5.17). Let us fix a sequence (\ n ) n &N C (0, oo) such 



mann 



2008 



that limn-s.oo A n = and, first, prove 



\fA,P,X„ 



\H 



< WfA,p\ 



11 



Vn G N 



(79) 



by contradiction. If (79) was not true, then /a,p,a„ ||# > ||/a,p||# for 
some n G N and it would follow from /a,p G Hq that T^A,P,x n (fA,P,x n ) > 
T^A,p,x n (fA,p) i which is a contradiction to the definition of /a,p,a„ • That is, 
we have shown (79). Now, we prove (56) by contradiction. If fA,p,X n does 



not converge to Ja,p, then there is an e > and a subsequence (fA,p,x ne ) t 
such that 



\fA,P,X n , - fA,p\ 



> e 



(80) 



Due to (79), the subsequence (/a,p,a„ )^ gN is bounded and, therefore, there 



is a furtner subsequence which converges weakly; see, e.g., (Dunford and 



Schwartz, 1958[ Cor. IV. 4. 7). That is, we may assume without loss of 
generality that (/a,p,a„ )^ gN fulfills (80) and converges weakly to some 
f G H. Then, it follows from j78|> that Tur^co TZ-A,p{fA,P,X nt ) = R>A,p(fo)- 
Hence, Prop. 6.4 implis that TZXp(fo) = hrf/eH T^-A,p(f) and, due to (55), 
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we either have ||/o||/f > || /a,P II i? or /o = /a,p- However, weak convergence 
implies 



H/ollij < liminf /a,p,a„, \\ h < hmsup \\f A ,p,\ ne \ \ H S /a,p h 



This proves / = /a,p and lim^^ ||/a,p,a„J|p = ||/a,p|| h - Finally, this 
latter convergence of the norms together with weak convergence o f fA,p,x„ f 
to fo = Ja,p implies lim^oo \\fA,p,x ne - fA,p\\ H = 0; see, e.g., (Conway 



1985, Exercise V.1.8). This is a contradiction to (80). 



Proof of Theorem I3.lt Theorem 3.1 immediately follows from Theorem 
lOandEB] □ 



Proof of Theorem 13. 2\ Theorem 3.1 guarantees existence of /a,p G H 
which fulfills (21) and (22). As in the case of ordinary regularized kernel 
methods (i.e. A = id), 

< TlA,p{fA,P,X n ) ~ K*a,P < T^A,P,X n {fA,P,X n )-T^*A,p < 
< TZ A ,P,X„ (fA,p) — T^*A,P ^ 1 1 fA,P || \ ; 



see ( Steinwart and Christmann] 2008, p. 182). Hence, 

T^A,p(fA,P,X„) -T^*a,p < A n ||/A,p||^ • (81) 
Fix no 6 N such that a n > 1 for every n > uq. For every e G (0, 1), define 
:= {D n £(2xJ) n an||/A,D„,A n -/A.P.AnIL <^} VneN 



and a := sup ngN \\fA,p,x n ||oo + 1- It follows from Theorem 6.5 and (|9j) that 
< a < oo. Then, Assumption 2.2 implies that, for every D n G B £n and 
n > no, 

a n T^A,p{fA,D n ,x n ) -T^A,p{fA,p,x n ) < 

< J b 'o + b'xoP dP ■ a n \\f A ,D n ,x n - /apaJL 

Since lin^^oo P n ( y B £ ^ = 1 for every e G (0, 1) according to Theorem 
and Q, it follows that 

Urn p({a,G (Zx3>) n | a n |^A,p(/AD, l ,A„)-^A,p(/AP,A„)| < e}) = 1 



for every e > 0. Then, (26) follows from ( |8l[ ) an d (|25[ ). 

Finally, an elementary calculation shows that (|25[) is fulfilled for A n = 

<y ra -l/(4+p) \/n G N for a constant 7 G (0, 00). □ 
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Proof of Lemma 16.6b According to the model assumptions, there is a 
set B G such that Pz(B) = 1 and, for every z G B, the conditional 
distribution P (-\z) is equal to the distribution of t* + s(z)e where t* = 
(AfA,p){z) is the (unique) median of the conditional distribution P(-\z). 
Now fix any z G B. Since t* is the median, 



P((-oo,t* z ]\z) > I > P((i*,oo)|z) 
P[[t* z ,oo)\z) > \ > P{(-oc,t* z )\z) 



and the model assumptions ([28])— ( 32 ) imply 



P{[t*,t* z + 6) 
P((t* z -5,t* z ) 



> 
> 



Ch 



■ 6 



■ 5 



V<5 G (0,a) 
V5G (0,a) 



(82) 
(83) 



(84) 
(85) 



The rest of the proof is divided into different cases; first, we consider the 
case that t > t*. Note that 



\y-t\-\y-t*\ > 
\y-t\-\y-Q > t* z -t 



if t* z < y < \{t* + t) 
if \{t* z + t) < y < t 



(86) 
(87) 



Then, dividing the domain of integration by t*, \{t* z + t), and t into four 
parts yields 

L(z,y,t)P(dy\z) - J L(z,y,t* z ) P(dy\z) = f \y - t\ - \y - f z \ P(dy\z) 



(t-f z ).P^-oo,f z 
-(t-t* z ).p((l(t:+t),t) 

(t-t* z )-p((t* z , l(t* z + t))\z) 



z) +0-P (t 



j.* 1 



(f z + 1)) 



(t-t* z )-P([t,oo 



Ch 

2c, 



*\2 



(t - Q 



Next, we consider the case that t < t*. Similarly as before, dividing the 
domain of integration by t, \{t* z + t), and t* into four parts yields 

/ L(z,y,t) P(dy\z) - J L(z,y,t* z ) P(dy\z) = f\y-t\- \y - t* z \ P(dy\z) 



> ~(t* z -t)-P 



oo,t\ 



(t* z -t)-P((t, l(t* + t)] 



(tl-t)-p([^t* z +t),t* z ) 



Ch 

2c, 



Finally, the remaining case t = t* is trivial. 



□ 
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Proof of Theorem 13. 3t First, note that P and L fulfill Assumption |2.2 



for p = and that the model assumptions (28)-(32) imply T^A,p{fA,p) = 
mi feH TZA,p(J)- 

According to (|7) and Theorem 6.5 there is an no G N such that \\AfA,p,x„ — 
AfA,p\\oo <\\A\- \\fA,P,x n ~ Ia,p\\h < ahQ s =■ a for every n > n . Hence, 
it follows from Lemma 6.6 and (81 ) in the proof of Theorem 3.2 that there 
is a constant c G (0, oo) such that 

i 

\\Af A ,p,x n - Af A ,p\\ L2 ( Pz ) < c\n Vn > n . 



According to Assumption ( 39 ) , 



E(fA,P,Vj) H 
— — -v<j e L 2 (P Zoj 



a 3 



and it follows from ( 35 ) and the fact that {vj \ j G N} is a complete or- 
thonormal system of H that 



Ia,p = A^go . 



(89) 



An easy calculation shows 



|/a,P,A„ — /a,p||#-||/v!,P,A„||# = 2{fA,P — fA,P,X n ifA,p) H —\\fA,p\\ H - (90) 



According to the definitions, we have 

K A ,p{fA,p,\ n ) -K A ,p{fA,p) > Vn G N 
KA,p,\ n {fA,P,\ n ) -K A ,p,x n {fA,p) < VnGN. 

Recall that H a ,p,xM) = K A ,p(f) + KWh- Hence > 
An||/A,F,A n -/A,p|| H < 

T^A,p{fA,P,X n ) - 7tA,p{fA,p) + A n \\fA,P,X n \\ 2 H ~ 

II 1 1 2 || 1 1 2 

+ An\\fA,P,X n — fA,P\\ H - An\\fA,P,\ n \\ H = 



(91) 
(92) 



(90) 



2K(fA,p- fA,p,x n , Alg ) H = 2\ n (A fA,p-A fA,p,x n ,go 



T^A,P,X n [fA,P,X n )—T^A,P,X n (fA,P) + 2A n (/ J 4 i p- f'A,P,X n , fA,p) H < 

L2(Pz ) 



< 2X n \\AofA,p,x n -A f A ,p\\ L2{Pzo) ■ b4 L 2 {P ZQ y 



Together with (88), this implies that there is a constant c G (0, oo) such 
that 

II l|2 1 

||/a,p,a„-/a,p||# < c\l Vn>n . 
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Hence, if bim^oo a n Xn = 0, it follows from Theorem 



6.3 



that 



II 1 1 2 

a n \\fA,T> n ,x n - Sa,P \\ h > in probability. 



Next, let Assumption (41) be fulfilled. Then 



oo , » 

and 

. . oo . . oo 

- 52vj(x)'Vj ™ E(^,$(x)) H -^ = $(z) 
i=i i=i 

where we have used in the last equality that | j E N} is a complete 
orthonormal system of H. Hence, 

fA,P,Xn(x) - fA,p(x) ^ (fA,P,\ n -f A ,P,Hx)) H = 



(/a,p,a„ - fA,p,A* g) H = (A f AtPjXn - A f A ,p , 5 , )i 2 (p ZQ 



and, accordingly, it follows from (88) that there is a constant c G (0, oo) 
such that 

[fA,px„( x ) ~ fA,p(x)) < c\ n Vn>n . 



Hence, if limjj-^oo a n X n = 0, it follows from Theorem |6 . 3| and the reproduc- 
ing property (fTol) that 



□ 



i(f A ,T> n ,x n (x) - fA,p(x)) 2 > in probability. 
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